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Abstract 

We propose a new framework for designing test and query functions for complex structures 
that vary across a given parameter such as genetic marker position. The operations we are inter- 
ested in include equality testing, set operations, isolating unique states, duplication counting, or 
finding equivalence classes under identifiability constraints. A motivating application is locating 
equivalence classes in identity-by-descent (IBD) graphs, graph structures in pedigree analysis that 
change over genetic marker location. The nodes of these graphs are unlabeled and identified only 
by their connecting edges, a constraint easily handled by our approach. The general framework 
introduced is powerful enough to build a range of testing functions for IBD graphs, dynamic pop- 
ulations, and other structures using a minimal set of operations. The theoretical and algorithmic 
properties of our approach are analyzed and proved. Computational results on several simulations 
demonstrate the effectiveness of our approach. 
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1 Introduction 



In modern genetic analyses, we have genetic marker data Yj; on individuals available at multiple 
genetic markers across the genome, and the parameters of genetic marker models, Tm, are generally 
well established. By contrast the models Tt underlying trait data are less clear. The goal of genetic 
linkage analyses is to locate DNA that affects the trait relative to the known locations of the genetic 
marker map. This requires computation of the conditional probability P(Yy | Yjf ; T), where T is 
a joint model comprising Tm and Tt and a specification 7 of the relative genome locations of DNA 
underlying Ym and Yj. This probability will be required for multiple specifications of the location 
of DNA affecting the trait, and may be required for multiple values of trait parameter values Tt and 
potentially for multiple trait phenotypes on the same sets of related individuals. 
This key probability is most easily considered as 



P(Y T \Y M :T) = J2 P(Yt | Z ; 7, T) P(Z | Y M ; r M ) 



(1) 



where Z is an collection of latent variables insuring the conditional independence of Ym and Yj given 
Z. Classically, the chosen latent varia bles Z were the unobserved genotypes (types of t he DNA) of the 



individuals of the pedigree structure [Elston and Steward . Il97ll lLathrop et alj . Il984| . More recently 



a spe cification of the inheritance of the DNA at all relevant locations ha s been the preferred choice 
of Z Lander and Greenl . 119871 lLange and Sobell . Il99ll iThompsonl . Il994j . O n large pedigrees, with 
data at multiple marker locations, the probability in ([1} cannot be computed exactly, especially if the 
data are sparse on the pedigree structures or if these structures are complex. Instead, realizations of 
Z from P(Z I Ym ; Tm) are obtained. A Monte Carlo estimate of P(Yt | Ym ; T) is the mean of 
the values of P(Y T | ; 7, T T ) over the N realized {Z< » : k = 1,...,N}. Several effective MCMC 
methods have been develop ed to obtain these realizations [Sobel and Lange . 19961 . Thompson . 200Ct 
Tong and Thompson! . [ioO^ . 

In the context of modern informative marker data, an efficient choice of Z is the pattern of 
gene identity by descent (IBD), across the chromosome, among individuals observed for the trait 
Thompson! . 1 2 1 ll | . This defines a graph , the IBD graph, which, at each locus, is analogous to the 



descent graph of [Sobel and Langel 1996 1. The edges of this graph are the observed individuals, and 
the nodes represent IBD sharing at this genome location among the edges (individuals) connecting to 
that node. The IBD graph is a deterministic function of the inheritance specification, and for marker 
genotypes Y m observed without error, computation of the probability of these data for a given IBD 
graph is easy Kruglvak et al. . 1996 . Sobel and Lan ge. 1996], Thus use of the IBD graph led to greater 
efficiencies in obtaining MCMC realizations from P(Z | Ym ; Tm). However, it has been less well 
appreciated th at computa t ion of P(Yt I 7^ k ' ; 7, Tt) is also straightforward using the IBD-graph 
representation Thompson . 2003 . Thompson and Heath . 1999bl |. 

Use of the IBD-graph has other immediate advantages. They arc generally slowly varying across 
the chromosome, relative to modern marker densities, and may be output from the MCMC in compact 
format, with only the change points and changes specified. Once the IBD graph is realized, the pedigree 
structure is no longer required in subsequent trait-data analyses, providing greater data confidentiality, 
and the same set of realized IBD graphs may be used for multiple values of (7, Tr) and even m ultiple 
different traits observed on the same or different subsets of the individuals Thompson! . 12011 1. When 
reduced to the subset of individuals observed for a trait, components of an IBD-graph Z arc generally 
small, so that for single-locus models P(Y^ | Z ; 7, Tt) is very easily computed, In fact, computation 
on the joint graphs at se veral genome locations is a lso feasible, leading to methods for genetic analysis 
under oligogenic models Su and Thompson! . 2012 1. Finally, the IBD framework has a key advantage 
in that it is not dependent on the source of the inferred IBD. 



2 



Using popu lation-based metho d s, IBP may be inferred betwee n any two individuals not known 
to be related [Brown et al. . 2012 , Browning and Browning , 2010j . If these individuals are pedi- 



gree founders or members of different pedigrees, such population-based IBD may be combined with 
pedigree-based in ferences of IBD to create merg ed IBD graphs, provided greater power and resolution 
to trait analyses [Glazner and Thompson , 2012| . 

There is another huge computational advantage potentially available from the IBD-graph frame- 
work. In an IBD graph, nodes have an identity only through the edges that connect them. Many 
different inheritance patterns S give rise to the same node-unlabelled IBD graph on the subset of trait- 
observed individuals. In an MCMC analysis, many different realizations of Z from P(Z | Yjf ; Tm) 
may give the same IBD graph. Additionally, because IBD-graphs are generally slowly varying, a given 
realized IBD-graph may remain constant over several Mbp. Clearly, P(Yt | Z; 7,Tt) should be 
computed once only for each distinct Z. Recognition of when IBD-graphs are equal and of the marker 
ranges over which they are equal, is crucial to efficient trait-data analyses. The software developed 
in this paper performs this task efficiently, and can decrease the burden of the trait-data probability 
portion of the LOD score est imation procedure by up to two orders of magnitude in real studies 



[Marchani and Wiisman, 12011 



The key of our approach is to represent the object properties relevant to the testing by sets of 
representative hashes instead of the objects themselves. Hashes permit much faster algorithms in 
many cases; for example, testing whether two graphs are equal can be done by checking whether two 
hashes are equal. These hashes are strong in the sense that intersections - unequal objects or processes 
mapping to identical hashes - are so unlikely as to never occur in practice. Furthermore, we introduce 
several provably strong operations on such hashes that allow accurate reductions of collections while 
maintaining specified relationships between the hashes. Thus in our approach, designing test functions 
is equivalent to designing composite hash functions accepting one or more input objects and returning 
a representative hash. Testing equality over collections of input objects is then equivalent to testing 
equality of the output hashes; set operations over object collections is equivalent to set operations 
over the hashes, and so on. 

We allow the objects in our framework to change over an indexing parameter. We refer to this 
index as a marker, as it refers to genetic marker position in our target application, but it could just as 
easily refer to time or any other indexing parameter. The power of this framework is that the building 
block operations process along all the possible marker values; the difficulties introduced by dynamic 
data is abstracted away. 

The running example we us e to illustrate this fram e work a re identity-by-descent graphs, or IBD 
graphs [Sobel and Lange . 19961 . Thompson and Heath . 1999a . As an abstracted structure, these 
graphs have two interesting and distinctive properties. First, only the links are identifiable. In 
other words, equality on the graph structure is done strictly over links and the set of links attached to 
each node. Second, these graphs change over marker index; one or more links may be in different con- 
figurations between distinct marker points. These graphs can be arbitrarily large, with an arbitrarily 
large range of marker values over which links can change location, so brute force equality testing at 
specific marker values quickly becomes infeasible. The computational problems are exacerbated when 
one wishes to work over large collections of these, matching graph structures and looking for patterns. 

To set up this example, consider the graph shown in FigurcQ] Snapshots of the graph G are shown 
at three marker values, mi < < m^. Now consider G(mi) in Figure [Tal As the nodes and links 
all have distinct labels, it can be represented by either listing the edges connected to each node or the 
nodes connected to each edge as shown in Tables [Ta| and [lb] respectively. Note that the structure of 
the graphs is uniquely described by the sets in the right hand column of either table. This allows us 
to test a graph structure without considering labels on the nodes but only on the edges, as we do for 
IBD graphs. For equality testing purposes, this graph can be represented exactly by first computing 
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(a) G(mi) 



(b) G(m 2 ) 



(c) G(m 3 ) 



Figure 1: An example IBD graph G[m] with labels on both edges and nodes. The nodes (numbered) represent 
genetic sequences, while the edges (lettered) represent individuals. The graph changes slightly by marker value 
and is shown at marker values mi, m?, and 7713. Note that under the identifiability constraints for IBD graphs, 
(a) and (c) are distinct despite having the same skeletal structure. 



a hash over each connecting set, then over the set of resulting hashes; this is essentially what we do. 
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Table 1: (a, b) Two representations of the graph G(mi) from Figure [Taj by node (a) and by edge (b). (c) 
The same graph G at all marker positions with validity sets representing the differences in the graph at marker 
values mi < 7712 < 7773. 



To extend this to dynamic graphs, we can associate validity information with the components of 
the graphs. Figure [TJ shows G at marker locations mi < ?7i 2 < 7773, with slight but significant changes 
between them. Restricting ourselves to looking only at the by-nodes representation in Table lla[ as 
the other is analogous and this one is appropriate for IBD graphs, we can describe G by Tabic [Tc] 
This produces a collection of sets that varies by marker value; this example will be explored further 
throughout this paper since working with dynamic collections such as this one is the target application 
of our framework. 

The paper is structured as follows. In the next section we briefly describe some related work, mostly 
involving innovative uses of non-intersecting hashes. In section [3J we formalize what we mean by a 
hash and describe a set of basic functions over them with theoretic guarantees. Then, in section 2J we 
extend our theory of hashes to include marked hashes by describing marker validity sets and the data 
structures we use to make marker operations efficient. Section [5] details M-Sets, our most significant 
contribution, along with available operations. Finally, in section [6j we illustrate the flexibility of our 
approach with several examples. 
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2 Related Work 



Hash functions and related algorithms have seen numerous applications. The unifying principle is 
that a short "digest" is calculated over the message or data in such a way that changes in the data 
are reflected, with sufficiently high probability, in the digest. 

Arguably the most widespread use of hash-like algorithms are with check-sums and c yclic redun- 
dancy checks (CRCs) [Maxino and Koopmanl . 120091 Nakassisi 1988 , Peterson and Brown | . These are 
used to verify data integrity in everything from file systems to Internet transmission protocols, and 
are usually 32 or 64 bit and designed to detect random errors. The checksum is stored or transmitted 
along with the data. When the data is read or received, the checksum is recalculated; if it doesn't 
match up with the original checksum, it is assumed that an error occurred. 

In cryptography, hashes, or "mess age digests" , giv e a signature of a message without revealing any 
information about the message itself [Schneier . 2007 1. For example, it is common to store passwords 
in terms of a hash; it is impossible to deduce what the password is from the hash, but easy to check for 
a password match. Much like a checksum, it is also used to ensure messages have not been tampered 
with; as it is extremely difficult to produce different messages that have the same digest. It is this 
property that we utilize in our approach. 

Representing data by a hash is also common. Hash tables, a data structure for fast lookup of 
objects given a key, works by first creatin g a hash of the key a nd using that hash to index a location 



in an array in which to store the object jCormen et all 1 2 lj . The hashes used in such tables are 



usually weak, as calculating the hash is a significant efficiency bottleneck and the size of the lookup 
array determines how many bits of the hash are actually needed - usually not all. Collisions - distinct 
operations mapping to the same hash - may be common, so further equality testing is performed to 
ensure the indexing keys match. Thus such hash tables tend to be relatively complicated structures. 

Stronger hash functions usually produce hashes with 128 or more bits, large enough that the prob- 
ability of collisions is so low as to never occur in practice. Database applications of ten use such hashes 



to in dex large files, as the non-existence of collisions greatly simplifies processing [Silberschatz et al 



119971 ] . Similarly, network applica tions often use such hashes to cache files - files having the sam e hash 
do not need to be retransmitted [Barish and Obraczkel I2OO0I iKarger et all Il999i IWansd . 11999] . Often 
cryptographic hash functions are used for this purpose; while slower, they are computed without using 
network resources, so calculating them is not the main efficiency bottleneck. Furthermore, they are 
strong enough that hash equality essentially guarantees object equality. 

Our application extends several of these ideas, most notably the last one. We use strong hash 
functions to represent arbitrary objects in our framework, assuming equalities among hashes are 
trustworthy. However, our framework extends previous work in that it relics heavily on several oper- 
ations over hash values to reduce the information present in collections down to a single hash that is 
invariant to specified aspects of a process. We present several theorems that guarantee the summary 
hash is also strong. This allows us to reduce computations that would be complex when performed 
over the original data structures to simple operations over hashes while ensuring that the results are 
accurate. 



3 Hashes 

For our purpose, the hash function Hash maps from an arbitrary object or other hash to an integer 
in the set Hat = {0, 1, N — 1} in a way that satisfies several properties. First, such a function must 
be one-way; i.e. no information about the original object can be readily deduced from the hash (e.g. 
"abedef" and "Abcdef" map to unrelated hashes). In other words, having access to the output of 
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such a hash function is equivalent to having access only to an oracle function that returns true if a 
query object is equal to t he original and object and, with high probability, false otherwise Canetti , 



1997, Canetti et al.. 1998 



Under these requirements, Hash can be seen as a discrete, uniformly distributed random variable 
mapping from the event space - arbitrary objects or other hashes, etc. - to ~Hn- This form allows 
us to assume Hash(w) has a uniform distribution otlUn for an arbitrary object uj, a form useful for 
the proofs we give later on. Furthermore, the "oracle" property implies that the distributions of two 
hashes are independent if the indexing objects are distinct. 

Second, Hash must be strong, i.e. collisions - unequal objects mapping to the same hash - are 
extremely improbable. Formally, 



Definition 3.1 (Strong Hash Function/ 
uj to a an integer in Hn = {0, 1, ...,N- 
Hash(w 2 ), 



A hash function Hash mapping from an arbitrary object 
1} is considered strong if, for hi = Hash(wi) and h 2 = 



P(hi = hi) = 



1 

l/N 



(2) 



The idea is to set N large enough (in our case around 10 38 ) that the probability of two unequal 
objects yielding the same hash is so low as to never occur in practice. However, in light of the fact 
that collisions can theoretically occur with nonzero probability, we denote inequality as no instead of 
^; specifically, if hi = Hash(^i) Hash(w 2 ) = h 2 , then P(hi = h 2 ) = l/N. 

We may assume the existence of such a hash function, denoted here as Hash, which maps any 
possible input - strings, numbers, other hashes - to a hash that satisfies definition ^. II This assumption 
is reasonable, as significant research in cryptography has gone towards developing hash functions that 
not only satisfy definition 13. H but also prevent adversaries with large amounts of computin g power 
against deducing any information about the original object |Goldreich . 2001 , Schneier . 2007 1 . These 
hash functions are widely available and have open specifications; we use a tweaked version of the well 
known MD5 hash function as outlined in appendix | Appendix A| 



3.1 Hash Operations 

Based on the existence of a hash function Hash, and simple operations on integers in Hn, we propose 
two basic operations to combine and modify hashes. The first is a way to summarize an unordered 
collection of hashes by reducing it to a single hash that is sensitive to changes in the hash value of 
any key in the original collection. The second, to be used in nested function compositions, is a way 
to scramble a reduced hash value so that it locks invariance properties present earlier in the function 
composition. In this section, we formally describe these operations, which will later be generalized to 
both marked hashes and then to collections of marked hashes. 



3.1.1 Transformations 

We now must formalize what we mean by a transformation in the testing function context. In our 
terminology, a transformation always applies to the inputs of a testing function and is done without 
regard to the hash values themselves. For example, any reordering of the input values is a valid 
transformation, but appending a precomputed string to the label of an input object to cause its hash 
to be the special null-hash - all zeros - is not (Note, however, that forming the null-hash in this way 
is near-impossible in practice). Formally, 

Definition 3.2 (Transformation Classes). A transformation class for a testing function T satisfies: 



G 



Tl. Every T G T can be expressed as a transformation of the non-hash input objects. 

T2. No T G T takes account of the specific hash values produced by these objects. 

Given these restrictions on transformation classes, we can now formally define what we mean by 
invariance. 

Definition 3.3 (Invariance and Distinguishing). A function / : TL% H> H n accepting a set of inputs 
h = (hi, h,2, h n ) is invariant under a class of transformations T if f(Th) = /(h) for all T G T 
and for all h G %n- Likewise, / distinguishes T if, for Ti,T 2 G T, /(Tih) /(T 2 h) *< /(h) unless 
Tih = h, T 2 h = h or Tih = T 2 h. 

In other words, the output hashes change under distinguishing transformations and are constant 
under invariant transformations. With these formal definitions, we are now prepared to define atomic 
operations that have specific and provable invariance properties. 

3.1.2 The Null Hash 

We chose one value in our hash set, specifically 0, to represent a Null hash. This hash value, denoted 
as 0, is used to represent the absence of an input object. It most commonly represents the hash of an 
object that is outside its marker validity set; this will be detailed more in section 01 As such, it has 
special properties with the hash operations outlined in the next section. 

3.2 Hash Operation Properties 

We here propose two basic operations, Reduce and Rehash. The first reduces a collection of n 
hashes, hi, h%, h n for n = 1, 2, down to a single hash, while the second rehashes a single input 
hash to prevent invariance properties from propagating further through a function composition. The 
key aspects of these operations are what transformations over the inputs they are invariant under; we 
describe these next. We follow this with a brief discussion of the implications of these results, before 
detailing the construction of such functions in section 13.31 

Definition 3.4 (Reduce). For the Reduce function, with ho = Reduce(/ii, h n ), we have the 
following properties: 

RD1. Invariance Under the null hash 0. The output hash ho is invariant under input of the null hash 
0. Specifically, 

Reduce(/ii,0) = Reduce(/ii) 
Reduce(0) = 

RD2. Invariance Under Single Mapping. The output hash ho equals the input hash hi if n = 1. 
Specifically, 

Reduce(/ii) = hi 

RD3. Existence of a negating hash. There exists a negating hash, here labeled —hi, that cancels the 
effect of an input hash in the sense that 

Reduce(/ii, -hi) = Reduce(0) = 
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RD4. Order Invariance. The output hash h$ is invariant under different orderings of the input. Specif- 
ically, 

Reduce (hi, h 2 ) = Reduce(/i 2 , hi). 

RD5. Invariance Under Composition. The output hash h$ is invariant under nested compositions of 
Reduce. Specifically, 

Reduce(/ii, h 2 , h 3 ) = Reduce(/ii, Reduce(/i 2 , h 3 )) 
= Reduce(Reduce(/ii, h 2 ), h 3 ) 

RD6. Strength. The Reduce function distinguishes all other transformations in the sense of definition 
13.31 ( e.g. an input value is dropped or changed). 

Two remarks are in order. First, property IRD5I allows us to expand all nestings of Reduce to a 
single function of hashes that are not the output of Reduce operations. For example, 

Reduce(/i 1; Reduce(/i 2 , Reduce(/i 3 , /i 4 ))) = Reduce(/ii, h 2 , h s , /i 4 ) 

The implication is that when we have a collection of input hashes hi, h%, h n , which may or may not 
have come from a reduction themselves, we can write their reduction out as a single reduction; i.e. 

Reduce(/h, h 2 , h n ) = Reduce(/i' 1 , h' 2 , h' n t) 

for some hashes h[, h' 2 , h' n , that are not the result of Reduce. 

Second, property [RD3] allows us to remove elements from a reduction once they are added, making 
the reduced hash invariant under changes in whatever process produced the canceled hash. This 
property becomes especially useful later on when working with intervals; the output hash varies as a 
function of a marker value, and a hash h valid on an interval [a, b) of that marker can be added once 
at a and removed at b, with the net result being that the output hash is only sensitive to h on [a, b). 

Definition 3.5 (Rehash). For the Rehash function, we have only two properties, which we list 
here. 

RH1. Invariance Under the null hash 0.The output hash ho is invariant under input of the null hash 
0. Specifically, 

Rehash(0) = 

RH2. Strength. The Rehash function is strong in the sense of definition 13. 1[ in which the object space 
is restricted to hash keys. 

The purpose of the Rehash function is to freeze invariance patterns from propagating through 
multiple compositions. Returning to the example IBD graph in Figure [TJ consider the testing function 
shown in Figure The function resulting from chaining Reduce and Rehash together as shown is 
invariant under changes in the node labels or orderings of the edges within each node, but is sensitive 
to any structural change in the graph. This can be proved by decomposing the nested Reduce 
functions into a single function of the first group; all the invariant relationships of this single function 
are present in the original. However, the final reduce cannot be decomposed this way on account of 
the Rehash functions. 
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Figure 2: Part of a composite function for testing equality, ignoring node labels, of the graph G(mi) in 
Figure [T] The part of the function for the first 3 nodes are shown. For each node, the labels on the connected 
edges are reduced into one hash, and then the hashes representing the nodes are reduced to a final hash which 
represents the graph. This final hash is invariant under changes in the node labels or orderings of the edges 
within each node, but is sensitive to any structural change in the graph. 



3.3 Function Construction and Implementation 

We now establish that functions Reduce and Rehash satisfying the appropriate properties exist. 
Along with this comes a requirement on N . namely that it is prime; this is required to preserve the 
strength of the Reduce function under multiple reductions of the same hash key. 

3.3.1 Basic Operations 

Before presenting the Reduce and Rehash functions, we first present two lemmas from elementary 
number theory. These lemmas provide the theoretical basis of the Reduce function. 

Lemma 3.6. Let N be prime, and let © and © denote addition and multiplication modulo N , respec- 
tively. Suppose a,b € TIn ■ Then the following equivalences hold modulo N: 

(a©6) ©6 = (a© 1) © 6 (3) 
-{a © b) = (-a © -b) 
~{a®b) = (-a © b) = (a © -b) 

Proof. The integers modulo N forms an algebraic field with distributivity of multiplication over ad- 
dition, so (([3])) is trivially satisfied. Furthermore, 

-/;, = (-1) ® h = (N - 1) © h mod N 

Thus —{a © b) mod N = (—a © —6) mod N. Similarly, multiplication is commutative, so 

-(a © b) = ((-1) © a) © b = (-a © b) mod N 
= (a © ((-1) © b)) = (a © -6) mod N 

The lemma is proved. □ 
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Lemma 3.7. Let N be prime, and let X and Y be independent random variables with distribution 
11n{J-Lj^), i.e. uniform over Hn, and let r be any number in "Hat. Then 

C®X~ U n{U N ) (4) 

-X mod N = N-X~ <Un{rl N ) (5) 

X®Y~Un(U N ) (6) 

r®X~Un{U N ), (7) 

i.e. the above are all uniformly distributed onH^. 

Proof. For ((4]), note that addition modulo a constant is a one-to-one automorphic map on the hash 
space, thus every mapped number is equally likely. ([5]) is similarly proved. To prove ([S]), note that 
Y can be seen as a similar random mapping; however, every possible mapping produces the same 
distribution over Hn, so X ®Y has the same distribution as X © Y | Y, which, by (@| is uniform on 

n N . 

For (JT)), recall from number theory that r has an inverse mod N if and only if r and n are coprime, 
i.e. gca(r, n) = 1. Thus if N is prime, each r ihHn also indexes a one-to-one automorphic map under 
r © X, and the result immediately follows. □ 

3.3.2 Reduce 

We are now ready to tackle Reduce; if N is prime, then addition modulo N satisfies all the required 
properties. This operation is similar to part of the Fletcher checksum algorithm, which uses addition 
modulo a 16-bit prime for the reasons outlined in lemma [3771 

Theorem 3.8 (The Reduce Function.). Suppose N is prime. Then the function f : H N M- "Hn 
defined by 

f{h\, h 2 , h„) = hi + h 2 + ... + h„ mod N 
= hi © h 2 e ... © h n , 

where © denotes addition modulo N, satisfies vroverties \RD1§RD6[ 

Proof. Addition modulo N, with N prime, forms an algebraic group, so properties IRD11 IRD21 lRD4l 
and IRD51 are trivially satisfied. IRD3l is satisfied with —hi = N — hi mod N. 

To prove lRD6l it is sufficient to verify equation ([2]) in definition |(Tl)| Let ho = Reduce(/ii, h 2 , h n ), 
and let k = REDUCE(fci, fc 2 , k m ). Without loss of generality, by the previous properties, let the 
sequences be as follows: 

1. No hash in either sequence is the negative of another hash in that sequence. 

2. n > m; if not, swap sequences. 

3. There exists an index q £ {l,2,...,m} such that hi = ki for i < q and hi £ {k q+ i, k m } for 
i = q + 1, m. 

4. hi = ki = (so q > 1, to make bookkeeping easier). 

Now suppose the two sequences are identical, so n = m = q. Then ho = ko and we are done; this 
satisfies the first part of equation ([2]). Otherwise, we can use lemma l3~6l to represent ho and fco as 

ho = Q © (ai © Hi) © (a 2 © H 2 ) © • • • © (aw © H n >) 
ko=Q@{Pi® Ki) © (/3 2 © ifa) © • • • © (/?m' ® AV) 
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where Q = REDUCE(/ii, /12, h q ) and Hi, H2, H n /, K\, K2, K m i are all independent and 
a\, a n ' , y8i, /3 m ' denote the multiplicity of each hash. Now it remains to show that P(/io = fco mod TV) 
1/TV. Now 

P(/i = fco mod TV) = P(h © -fco = 0) 



and, using lemma IjTdI to distribute the minus signs and eliminate the Qs. 

h © -fc = (ai © Hi) © (a 2 © i? 2 ) © • • • © (aw © #„') 

© (/3i © -Kx) © (#2 © -#2) © • • • © (/3 m < © -K m .). 



However, applying lemma 13771 inductively gives that the distribution of the above is uniform over Hn- 
Thus POo = fco) = 1/iV. □ 

3.3.3 Rehash 

Now on to Rehash, which is far simpler as it relies mainly on the property of the hash function being 
strong. The only extra work is to ensure that the null hash is preserved. 

Theorem 3.9 (Rehash). The function Rehash : TLn h> Hn defined by 

Rehash(Ti) = Reduce(Hash(/i), — Hash(0)), 



satisfies properties (RH1) (RH2) while distinguishing all other transformations. 

Proof. Follows trivially from the properties of Reduce and the assumption that Hash is strong and 
one-way. □ 

In this section, we have presented the fundamental building blocks regarding hashes. We now 
augment these hash values with validity information that varies as a function of a particular parameter, 
here called a marker value. 



4 Hashes and Keys 

We define a key as a hash value associated with a set of intervals within which that hash, or the object 
it refers to, is valid. A key may represent an object in the data structure we wish to design a testing 
function for, e.g. an edge in a graph that is present only for certain marker values, or it may represent 
the result of a process or sub-process. At the marker values for which this key is not valid, we assume 
its hash is equal to 0. To denote the hash value of any key at a certain value p of the parameter space, 
we use brackets - e.g. h[p\. 

The set on which the hash value of a key is valid, which we call a marker validity set or just validity 
set, is a sequence of sorted, disjoint intervals of the form [a,, bi) C [—00, 00). A marked object is valid 
in each of these intervals and invalid elsewhere. Saying something is unmarked is equivalent - for 
bookkeeping reasons - to saying that it is always valid, i.e. on the interval [— 00, 00). 



5 Marked Sets 

The M-Set, a container of marked keys, is the most powerful component of our framework. It can be 
thought of as a collection of marked objects, stored as representative keys, that permits easy access 
to useful information about the collection. The idea is that one can express many algorithmically 
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complicated processing tasks involving dynamic data as simple operations on and between M-Set 
objects. Efficient operations on an M-Set include querying, insertion, deletion, testing collection 
equality at specific marker values or over the whole collection, union and intersection, and extracting 
the collection of keys valid at specific marker values. 

5.1 Operations 

Available M-Set operations fall into five categories: element operations like insertion, querying, or 
modifying an element's validity set; hash and testing operations like determining whether two M- 
Sets are identical at marker to; set operations such as union and intersection; validity set operations 
such as extracting all hashes valid at a certain point; and summarizing operations which produce 
representative hashes from one or more M-Sets. Of these, operations in the first four categories are 
easily explained; we present them in the next sections. The summarizing operation, which is key to 
the power of our framework, is presented in section [5721 A list of all these functions is given in section 
IB. 21 the most powerful ones we describe now. 

5.2 Summarizing Operations: REDUCEMSETand Summarize 

The natural generalization of Reduce to M-Set objects, ReduceMSet, returns an M-Set containing 
the reduction of every key in the set. Formally, for M-Sets To and T\, suppose To = ReduceMSet(Ti). 
Then, for each marker value to, there is exactly one key in To valid at to, with that hash being the 
Reduce of every key in T± valid at to. The M-Set is the appropriate output of this function, as the 
resulting hash value varies arbitrarily by marker value and thus cannot be expressed as a single key. 
Because such an M-Set has exactly one hash (possibly 0) valid at each marker value, we use the same 
bracket notation as keys to refer to that hash value, e.g. T[to]. 

Looking up the reduced hash of the M-Set at specific marker locations - HashAtMarker - is 
efficient to do without reducing the entire set. The main use for ReduceMSet is thus to create a 
lookup of the possible values of Reduce in that set and when they are valid as represented by the 
validity sets of the resulting keys. This can, for example, be used to determine the set on which a 
dynamic collection is equal to a given collection. 

Just as ReduceMSet summarizes the information in a collection of keys by a single M-Set, so 
Summarize reduces the information from one or more distinct collections of M-Sets down to a single 
M-Set over which computations can accurately and efficiently be done. Changes in any individual 
collection, as well as which collections are included, are always reflected in the summarizing M-Set 
unless they fall under one of the invariant properties (e.g. does not affect the outcome). 

As such, Summarize produces an M-Set T in which one hash is valid at each given marker position 
to. For T = Summarize(Ti, T 2 , ...,T„), the hash key valid at to, T[m], is equal to 

T[m] = Reduce({Rehash(/i) : 

h = ReduceMSet(T) at to, for i = 1, 2, n}) 

Given our implementation of ReduceMSet, described in the next section, the Summarize operation 
is very efficient and a central building-block in our framework. 

The summarizing operation is useful in that it allows us to efficiently test equality of collections of 
M-Sets using the operations designed for hashes. For example, suppose we have two summary M-Scts, 
T = Summarize(Ti, T 2 , T„) and U = Summarize([/i, t/ 2 , U n ). Given T\p], the marker validity 
set of the corresponding key in U, if any, gives the set in which the collection of T's at p is equal to 
the collection of Ui's. Likewise, MarkerUnion over all objects in the intersection of T and U gives 
the locations at which the two collections are equal. 



12 




Figure 3: A skip-list with 3 levels and 10 values. 

5.3 Implementation 

Internally, an M-Set is a combination of a hash table to store the hashes and a skip-list-type structure 
that handles the bookkeeping operations dealing with validity sets. This latter structure efficiently 
tracks the Reduce at each marker value of all keys present in the structure; this is key to making 
operations like equality testing and summarizing efficient. 



5.3.1 Skip Lists for Markers 

To introduce the augmented skip-list for the Reduce lookup, we first describe a simpler version 
for holding marker information. In a skip-list, the values are stored in a single ordered linked list; 
this allows for easy insertion and deletion, but by itself does not permit efficient access. To access 
them efficiently, there arc additional levels of increasingly sparse linked lists, each a subset of the 
previous, with each node pointing forward and pointing down to the corresponding node in the lower 
level. When a new value is inserted in the skip-list - in our case a validity interval - it also adds 
corresponding nodes in the L levels above it, where L ~ §eometric(p) . The geometric distribution of 
the node "heights" means the ex pected size of level L is np L . Overall, the expected t i mes for querying , 
inser tion, or deletion is O(logn) |Devrove . 1992 . Kirschenhofer and Prodinger . 1994 . Papadakis et all 
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An example skip-list is shown in Figure [3l In this skip-list, each marker location, denoted by 
k mi , k mn , has corresponding nodes in to 3 levels above it. The interval starting values are stored 
in the nodes at each level. 

Querying is done as follows. Start at the first node in the highest level, which is always at -co. If 
a forward node exists and its value is less than or equal to the query value, move forward; otherwise, 
move down. Repeat this until you're on the lower level and cannot advance any farther; if this interval 
contains the query value, that marker value is valid, otherwise it is not. Finding locations for insertion 
and deletion arc analogous. 



5.3.2 Internal Reduce Lookup 

We now present the data structure that comprises the second component of an M-Set. This structure 
allows for calculating Reduce, for a given marker value, over all components of the entire hash 
collection in logarithmic time while still allowing logarithmic time insertion and deletion. With this 
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Figure 4: The left part of the table in Figure [3] adapted to be an M-Set hash lookup structure holding hash 
information in the skip-list nodes. Hash keys, valid for certain marker intervals, are shown in the bottom, 
while the information stored in each node is shown below the node label. 



structure, we can also calculate Reduce over all validity set intervals values in time linear in the 
number of distinct validity interval endpoints, producing a new M-Set of keys with non-intersecting 
validity sets representing the different values of Reduce at various marker intervals. These algorithmic 
bounds and the associated algorithms will be formalized below. 

The structure we propose is an augmented skip-list. The leaf and node values present in the tree 
correspond to the interval endpoints in the validity intervals of any key present, i.e. the marker values 
where the validity of any key changes. This list is augmented to hold a hash key in each leaf and each 
node. 

The main idea for the leaf hashes is to track Reduce over all valid keys as a function of marker 
value m. Stepping through the leaves, starting with the hash value in the first leaf and updating it 
with the leaf hash using Reduce, yields Reduce over all valid hashes at each marker value. This is 
done by including a hash at the beginning of each of its validity interval and the negative of that hash 
at the end of a validity interval. Thus hash values are added and removed from the overall reduced 
hash to maintain the value of Reduce at each marker value. This allows us, additionally, to calculate 
the M-Set produced by ReduceMSet efficiently by simply stepping through the leaves. 

Figure [4] shows an example augmented skip-list, the structure of which is taken from the right half 
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of the example skip-list in Figure [3J The leaves hold hash values that are the reduction of hash values 
and/or negative hash values. At a given marker value, the value of Reduce over all previous leaves 
is the value of Reduce over all currently valid keys. For example, looking at the first three leaves, 

REDUCE(fc m6 , k mi , k ms ) 

= Reduce(Reduce(/ii, h 2 ), Reduce(-/ii, h 3 ), Reduce(— h 3 )) 
= Reduce(/ii, h,2, -hi, h 3 , -4i s ) 
= Reduce(/i 2 ) 

Formally, we maintain the following property: 

Property 5.1 (M-Set Marker Skip-List Leaf Hashes). For a given M-Set T with skip-list S, let 

L[m] = {h : TO = i for some [£, u) in the validity set of h} 
U[m] = {h : m = u for some [£, u) in the validity set of h} 

Then for all leaves in S , define the hash value r$ [m] at that leaf to be 

r [m] = Reduce({/i : h £ L[m]} U {—h : h £ U\m}}) 

Thus we can formally state the above. 

Theorem 5.2. LetT be an M-Set with corresponding leaf nodes J"o[*]. Let R[m] = REDUCE({ro[ro'] : ml < rn}). 
Then 

R[m] = ReduCe({/i[to] : h £T and h is valid at m}), 

Equivalently, 

R[m] = REDUCE({ft.[m] : h G T}), 

Proof. On the marker intervals where a hash key is valid, its hash is included in R[m] exactly one time 
more than its inverse is included, and it is included exactly the same number of times as its inverse 
at all other values. From property IRD31 the summary Reduce hash at m depends on a hash key if 
and only if that hash key it is valid at m. The equivalent formula for R[m] follows immediately from 
the fact that h[m] = if h is not valid at to; which does not change i?[m]. □ 

As mentioned, the Reduce of all leaf hash values whose associated marker value is less than or 
equal to the given marker value m is equal to the Reduce of all the keys in the M-Set valid at m. 
However, storing the hashes in this way at the leaves is not enough to efficiently compute Reduce 
quickly over the full hash table at a given marker value to, as it would require visiting every change- 
point present that is less than rn. One might suggest storing R[m], the full value of Reduce, at the 
leaves instead of just r [m], but then insertion and deletion would require time linear in the number 
of marker points present in valid intervals, and this can be arbitrarily large. 

Our solution is to store a hash value summarizing blocks of r [m] in the nodes at higher levels 
in the skip-list structure. The idea is that the hash value stored in the nodes is the reduction of all 
leaves under it. The presence of these hash values at the nodes allows us to construct logarithmic 
time algorithms for querying, insertion and deletion. The idea is that we can include the reduction of 
large blocks of nodes with a single operation as we travel down the skip-list. 

Formally, at the nodes, these hash values maintain the following property: 
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Algorithm 1: Hash AtMarker Value 
Input: M-Set T and marker value m. 

Output: Hash Value h, the Reduce over all hash objects in T at to. 
h <- 

n <— First node of highest level of skip-list of T 

while not at destination leaf do 

if next node n' has marker value < m then 
h <— Reduce(/i, hash at node n) 

n <— n' 
else 

n <— node below n 
return h 



Property 5.3 (M-Sct Skip-List Node Property). Let rj,[m] be a hash value at marker value m in the 
bth level of the skip-list with b > 1 (b = is the leaf level). Let ml be the smallest marker value larger 
than m, possibly oo ; such that there exists a node at level b with marker value m! . Then 

Tb[m\ = Reduce ({r b _i [to"] : m < m" < m'}). 

Equivalently, 

r b [m] = Reduce ({r [m"] : to < to' < m"}) (8) 

In other words, the hash value of a node at level b, b > 1, with marker value m is the Reduce over 
all nodes at level 6—1 whose marker value is greater than to and less than the marker value of the 
next node at level 6. Property IRD5I of the reduce function - invariance under composition - means 
that the hash value stored in a node is the Reduce over all the leaf values beneath it, i.e. reaching 
such a leaf requires passing through that node. This yields the equivalent formula (J8]). 

In Figure 01 the hash at node c2 is the Reduce over the hash at nodes 63 and 64; the hash at 
node 63 is the Reduce over the hash at nodes a 4 and a5, and so on. The net result of this is that 
the hash at each node is the Reduce of all the hash values beneath it. 

This allows us to calculate Reduce at any marker value in logarithmic time using Algorithm [T] 
This algorithm differs from the regular skip-list query algorithm only in that it updates a running hash 
as it traverses sideways. This means that at each point, the current hash h includes the reduction of 
all leaf nodes prior to the current marker value, i.e. before moving forward, the reduction of all leaf 
hashes between the current node and the next node is included in Reduce. This last statement is 
sufficient to prove the validity of the algorithm. 

The algorithms for insertion and deletion are similar but involve more detailed bookkeeping to 
handle the creation and deletion of nodes. Apart from this, the only difference from Algorithm [T] 
is that the hash at the node is updated when moving down, rather than across; this preserves the 
invariant that the hash at a given node is the Reduce of all the hash values stored under it. 

6 Example 

We now return to the motivating example, IBD graphs, given in section [TJ The individuals in this 
case are edges, which are assumed to be unique; the labels on the nodes are unidentifiable, requiring 
any testing functions to be invariant to them. 
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Algorithm 2: IBD Graph Summarizing 



Input: An IBD Graph G. 

Output: T, an M-Set summarizing G; the hash of T at a marker point m is invariant under 

permutations of the node labels. 
L <— Empty list 

for Each node n in G do 
T 4- Empty M-Sct 

for Each edge e attached to n on interval [£1,^2) do 
h <- HASH(e) 

/* Set [t\,t2) with key h to be valid in T. */ 

AddValidRegion(T, h, t 1} t 2 ) 

/* Append this new M-Set T to our list. */ 
append T to L 

/* The hash representation of the graph is the summary of all the graphs in L. 
*/ 

T 4- Summarize(L) 
return T 



Algorithm 3: IBD Graph Unique Elements 
Input: Si, S2, ■ S n , M-Sets summarizing n IBD graphs. 

Output: L, a list of (h, mi, m) tuples giving a reference hash, an index, and a marker 
location denoting one instance of each unique graph in the original collection. 

/* Form a single table of all unique graph hashes. */ 

H <r- Union(KeySet(^i),KeySet(S , 2 ),...,KeySet(^„)) 

/* Go through and find one index and marker value where each of these graphs 
occur. H tracks the graph hashes yet to be recorded. */ 

L Empty list 
for i = 1 to n do 

S <- Intersection^, H) 

foreach h in S do 

append (h, i, VSetMin(/i)) to L 
Pop(#, KEY(ft)) 
return L 



The main idea is to represent each node as an M-Set with keys representing edge labels. The 
validity set on each key denotes when that edge is attached to the node; this allows the structure 
of the graph to change over marker location. With each node represented this way, the entire graph 
can represented as the summary of the node M-Sets. At each marker point, this computes a hash 
over each edge within a node using Reduce, rehashes the result to freeze invariants, then computes 
a final hash over the resulting collections. Per the guarantees of Reduce and Rehash (definitions 
13.41 and I3.5[) , the resulting hashes of two graphs will match if and only all the nodes have identical 
edges, which is true if and only if the two graphs are equivalent (ignoring the completely negligible 
probability of hash intersections). 

Our first illustration, given in Algorithm simply tests if two graphs are equal. It also illustrates 



17 



how to set up the original graphs from a simple list-of-lists form. Beyond this, we are also be interested 
in all the unique graphs present in a collection of node M-Sets. Assuming these are summarized by 
Si , S2, ■■ •, S n as in Algorithm [2J we can use algorithm [3] to find a list of specific indices and marker 
locations that enumerate the unique graphs. Algorithms that need to be run, in theory, at each marker 
value can instead be run only at this set of points. 



7 Experiments and Benchmarks 

To demonstrate the effectiveness of this approach, Table [5] presents computation times on several real 
and simulated IBD graph collections along with the savings incurred by avoiding redundant operations. 
The experiments were all run on an Intel Xeon E5-4640 processor running at 2.40 GHz. Recall that 
the motivating computations to be run on the unique graphs (described in section [1} can hours when 
run on a collection of these graphs, so even a small reduction factor gives a significant time savings 
and easily absorbs the preprocessing time shown here. Total Graph Configurations is the number of 
potentially different graphs over which a computation needs to be run. On a single graph, it is the 
total number of intervals on which there is no recorded change in the graph; for multiple graphs, it 
is this factor summed over all graphs. Unique Graphs is the number of unique configurations within 
this set; running computations only on each of these is sufficient. 



Dataset 


Number 


Individuals 


Total Graph 


Unique 


Speedup 


Computation 




of Graphs 


per Graph 


Configurations 


Graphs 


Factor 


Time 


Iceland- 1 


1000 


95 


155,612 


150,290 


1.04 


2.18s 


Iceland- 2 


1000 


31 


67,809 


1,179 


57.5 


0.99s 


Iceland- 3 


30000 


31 


1,616,028 


1,376 


1174.4 


12.16s 


fglhaps-7 


1 


7000 


92,488 


92,483 


1.00005 


10.39s 



Table 2: Result and processing times for Algorithm [3] on several IBD graph datasets. 



Table [2] shows results for four examples. The three Iceland datasets consist of IBD graphs realized 
conditionally on marker data. The marker data are simulated on a pedigree structure described in 



Glazner and Thompson! . l2012j . Iceland- 1 IBD graphs contain a full set of 95 related individuals over 



12 generations, while the graphs of Iceland-2 and Iceland-3 are of a reduced set of individuals in the 
last 3 generations for whom marker and trait data were assumed available. The fglhaps-7 example is a 
single IBD graph with 7000 individuals and marker indexing from 1 to 140 million. This graph results 
from simulation of descent of a population of 7000 individuals over 200 generations [Brown et al ' 



2012 . 



For the full Iceland graph on 95 individuals, Iceland-1, there is little reduction in the number of 
graphs. However, for the subset of 31 observed individuals for whom the probability P(ly|Z;7, T) 
must be computed (equation (JTJ), there is a greater than 50- fold reduction even for only 1000 real- 
izations of the IBD graph. When the number of realizations is increased to 30, 000, the speedup is 3 
orders of magnitude, while the time to process the IBD graphs increases only from 0.99s to 12.16s. 
On the single graph of the fglhaps-7 example, there is little reduction from running the software, since 
there are few marker intervals where the IBD graph is repeated. However, this large is still processed 
by the software in a relatively negligible 10.39s. 

In addition to this, Figure ([5]) shows the computational results from simulation study of descent 
of chromosomes of length 10 8 base pairs over multiple population sizes, numbers of realizations, 
number of generations, and recombination rates. As can be seen, for smaller population sizes, there 
is a substantial speed improvement, often several orders of magnitude or more. Furthermore, as 
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(a) Percentage speed improvement by redundant computations eliminated. 
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(b) Processing time required to compute equivalence classes. 

Figure 5: Results showing computation savings in simulated IBD graphs generated from populations of 
4,10,20, and 40 individuals (lines), with descent over 5, 10, 50, and 100 generations (columns). The recombi- 
nation rates per generation are 1.0 x 10 -8 per base pair (top rows) and 5.0 x 10 -8 (bottom rows), with each 
individual a pair of chromosomes of length 10 s base pairs. Results are shown for the IBD graphs of the final 
generation (a;-axes), for sets of 100 to 100,000 realizations 
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the number of realized IBD graphs in a collection increases, disproportionally more redundancies are 
found, while the time required to compute the equivalence classes scales linearly. This indicates that 
even if our method takes several minutes to run - the most time taken in these simulations - it is 
always worthwhile. 

These examples illustrate the power of our framework in working with these types of dynamic data. 
The advantage of M-Sets and the given operations can be seen easily; many redundant operations can 
be eliminated. Not surprisingly, these gains are the most substantial on small graphs involving only 
a few individuals. However, even in the case where there is little reduction (e.g. fglhaps-7), the time 
taken to process the equivalence classes is negligible relative to the rest of the computations. It should 
be noted that Iceland-2 and Iceland-3 showed the most dramatic reduction in processing time. The 
Iceland examples are those where the multiple IBD graphs are realizations estimating a single true 
latent IBD graph, and are generated conditional on genetic marker data. The variation among graphs 
is therefore much less than in the independent realizations of descent in the other examples. These 
Iceland examples demonstrate the significant computational speed-ups that are possible in practice. 



8 Conclusion 

The representation of objects as hashes permits efficient set operations, which in turn allows many 
testing algorithms to be expressed in terms of these operations. On more complex data, summarizing 
and reduction operations allow data types with nested representations to also work with this frame- 
work. This is especially true in the target structure, the IBD graph, in which otherwise complex and 
slow tests can be expressed as simple and intuitive operations. Finally, we showed that real world op- 
erations can have substantial speed improvements when using our framework to eliminate redundant 
operations. 

The authors wish to thank Lucas Koepke for his contributions to the code base, Steven Lewis for 
rigorously testing it, and Chris Glazner for help with the experiments. This open source library is 



freely available online at http://www.stat.washington.edu/~hoytak/code/hashreduce 



Appendix A Hash Function 



The Hash function we use is CityHash [Google . 201l| . which produces a strong (though not crypto- 



graphic) 128 bit hash. We map the resulting hash to {0, 1, ...,N}, with the upper number chosen to 
be prime. In our case, we use N = 2 128 — 159 as it is the largest prime that can be represented by a 
128 bit integer. 



Appendix B Available Operations 

We here give a list of operations that are efficiently implemented in our library. 
B.l Validity Set Operations 

To work with validity sets, we introduce several operations. These can be broken into two categories, 
operations that act directly on the validity set of a key and operations that work between validity sets. 
The former includes operations for constructing and manipulating a validity set, testing whether a key 
is valid at a given marker value, and iterating through a key's validity set intervals. The latter class 
implements set operations. These operations all accept keys or a marker validity sets as arguments 
and return a key or validity set resulting from the respective operation. 
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IsValid(X, to) 

Returns true if m is a valid point in the validity set or hash object X and false otherwise. 

GetVSet(/i) 

Returns the validity set of a hash key h. 

SetVSet(/i,M) 

Sets the validity set of a hash key h to M. 

AddVSetInterval(X, a, 6), ClearVSetInterval(X, a, 6) 

Marks the interval [a, 6), a < 6, as valid or invalid, respectively, in the validity set or hash object 
X. 

VSetUnion(A, Y), VSetIntersectk)n(A, Y), VSetDifference(X, Y) 

Takes the set union, intersection, or difference between two validity sets or hash objects X and 
Y, returning the result as a hash object if both X and Y are hash objects, and as a validity set 
otherwise. 

VSetMin(X) 

Returns the lowest valid marker value to. 

VSetMax(X) 

Returns the greatest marker value m such that there are no valid regions greater than to. 
B.2 M-Set Operations 

These operations are all efficiently implemented using the previously described algorithms. 

B.2.1 Element Operations 

Exists(T, k) 

Returns true if a key with hash k exists in T, and false otherwise. 
ExistsAt(T, A:, to) 

Returns true if a key with hash k exists in T and is valid at marker value to, and false otherwise. 
Get(T, k) 

Retrieves any key having hash k from T. 

iNSERt(T, h) 

Inserts the key h into T. 

AddValidRegion(T, h, t 1} t 2 ) 

Sets the region [ti, t 2 ) with key h to be valid in T. If h is already present in the table, [ii, t 2 ) is 
set to be valid in that key's V-Set; otherwise, h is given the V-Set \ti,t 2 ) and inserted into T. 

Pop(T, k) 

Removes any key having hash k from T and returns it. 

B.2. 2 Hash and Testing Operations 

HashAtMarker(T, to) 

Returns the hash formed by Reduce over all the keys valid at marker value to. 
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Equal AtM arker(7i, T 2 , ...,T n ,m) 

Returns true if all M-Sets Ti, T2, T„ contain the same set of keys at marker m, and false 
otherwise. 

EqualityVSet(Ti,T 2 , ...,T„) 

Returns a marker validity set indicating where all M-Sets are equal. 

EqualToHash(T, h) 

Returns a validity set indicating the marker locations on which the reduction of T is equal to 
the hash h. 

B.2.3 Set Operations 

UNiON(Ti,T a ,...,T n ) 

Returns an M-Set containing the union over all keys. For each hash value, the new marker 
validity set is the union of the validity sets of all keys having that key. 

Intersection^, T 2 , —,T n ) 

Returns an M-Set containing the keys present in all input M-Sets, with the new validity set 
being the intersection of the originals' validity sets. Objects with no valid regions are discarded. 

Difference(Ti, T 2 ) 

Returns an M-Set containing all keys from T\ with the validity sets from any corresponding 
hash in T 2 is removed. Keys with empty validity sets are dropped. 

B.2.4 Marker Validity Set Operations 

MarkerUnion(T, M) 

Returns a new M-Set formed by all the keys in T, where the new validity sets are the union of 
the original and M. 

MarkerIntersection(T, M) 

Returns a new M-Set formed by all the keys in T, where the new validity sets are the intersection 
of the original and M. Keys with empty validity sets are dropped. 

Snapshot(T, m) 

Takes a "snapshot" of the M-Set at a given marker value, returning an M-Set of all the hashes 
valid at that marker value. 

KeySet(T) 

Returns a new M-Set in which all keys in T valid at any marker point in T are returned as an 
unmarked set. Equivalent to MarkerUnion(T, [-00, 00)). 

UnionOfVSets(T) 

Returns a validity set M formed by taking the union of the validity set of every non-null key 
present in T. 

IntersectionOfVSets(T) 

Returns a validity set M formed by taking the intersection of the validity set of every non-null 
key present in T. 
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